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@ Method and system for automatic non-disruptive problem determination and recovenr of 
communication link problems. 



@ In tiie method for problem determination and 
recovery of a failing resource on a communication 
link segment in a data communication networI<. The 
using node (6) passes Wnk event data to the commu- 
nication link manager (2) for analysis when a prob- 
lem occurs on a link segment. The communication 
link manager (2) interacts wltfi a configuration data 
base (4) to determine the physical configuration of 
the failing link segment and the controlling link con- 
nection subsystem manager (3). The communication 



link manager (2) directs the appropriate link connec- 
tion subsystem manager (3) to initiate tests of the 
various link connection components on the link seg- 
•ment under its control, \/Vhen the failing resource is 
identified, the communication link manager (2) ini- 
tiates the appropriate non-disruptive recovery proce- 
dure through the link connection subsystem man- 
ager (3) and prompts the data link control to restart 
the line. 
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METHOD AND SYSTEM FOR AUTOMATIC NON-DISRUPTIVE PROBLEM DETERMINATION AND RECOVERY 

OF COMMUNICATION LINK PROBLEMS 



This invention relates to data communication 
network management, and particularly to a method 
for automatic non-disruptive p roblem determina- 
tion and recovery of communication link problems. 

Numerous prior network link management 
problem diagnosis methods and devices exist In 
the field. These generally are designed in a man- 
ner specific to the particular physical devices that 
make up the data communication links. In tele- 
phone systems In general, loop back signals may 
be transmitted from a controlling node to the var- 
ious elements making up the communication link to 
a target node, and signals may be propagated 
down the communication link to be looped back by 
devices In the line which are operating correctly 
and which can respond to the loop back com- 
mands to the control node. Such processes are, 
however, totally Inappropriate when the data com- 
munication network devices operate under differing 
protocols, are supplied by different vendors and 
have differing diagnostic data providing capabilities. 

More sophisticated techniques for problem di- 
agnosis are also in existence where the data com- 
munication network contains devices supplied by 
the vendors of the host data node or control node 
systems. For example, the IBM Corporation mar- 
kets components for a data communications sys- 
tem including host system mainframe computers, 
operator control consoles and communication net- 
work control program applications that can commu- 
nicate with and diagnose problems occurring in a 
data communication network that Includes IBM 
supplied modems, multiplexers, switches and the 
like. However, these systems share a common 
system architecture and communication protocol 
and are thus amenable to a single diagnostic ap- 
plication program and protocol. They avoid the 
difficulty encountered as mentioned above, with 
numerous protocols in numerous physical devices 
supplied by diverse vendors. Such uniform sys- 
tems are, however, not often achieved and in many 
networks constructed by users and purchasers of 
various components, such uniformity cannot be 
brought about without major reinvestment. 

In the prior art. the recovery process is usually 
a manual process which is disruptive to the end 
user. Recovery techniques have required the net- 
work operator to manually intervene and circum- 
vent network link problems by switching to back-up 
components or by using alternate paths through 
the network with the high overhead costs asso- 
ciated with re-establishing communications be- 
tween network resources. 

Thus, the major problem the users and the 



purchasers are faced with is that of achieving high 
availability of these multiple layer, multiple vendor, 
multiple protocol networks. Once a failure occurs, it 
takes time for the operator to react to the problem, 

5 the end user usually being the first to notice the 
problem. Once it is determined that a problem 
does exist, problem determination must be per- 
formed to determine which recovery procedure 
provides the fastest restoration service to the end 

70 user. 

It is therefore desirable that some method and 
system apparatus be provided that is capable of 
accommodating the more usually encountered data 
communication networks which are constructed of 
f5 multiple layers of various vendors' physical devices 
having diverse capabilities and communication pro- 
tocols. 

It is an object of this invention to provide an 
Improved method and system for non-disruptive 

20 problem determination and recovery in data com- 
munication networks. In a layered system, the sys- 
tem overhead of session recovery is reduced by 
performing recovery at the layer where the failure 
is detected instead of passing the failing indication 

25 to a higher layer. The approach is to modify data 
link control to allow problem determination and 
recovery to be completed before sending the fail- 
ing (INOP) notification. If recovery is successful, 
then no INOP Is sent. Instead, an alert is sent 

30 indicating that a failure occurred and that recovery 
was successful. 

It is a still further object to provide an improved 
diagnostic method and system in which the com- 
munication link manager, having access to data 

35 concerning the physical configuration and char- 
acteristics of the network components which com- 
prise a link to an identified target node experienc- 
ing a problem, issues problem determination re- 
quests to an intermediate translation facility which 

40 translates these requests for diagnoses into device- 
specific commands addressed to particular devices 
in the link and which receive from such devices 
specific responses which are communicated to the 
communication link manager for problem resolution 

45 and application of the appropriate recovery proce- 
dure. 

The foregoing and still other objects not spe- 
cifically enumerated are provided by the present 
Invention in an architecturally designed system uti- 
50 llzing a novel communications technique. The sys- 
tem comprises a communication link manager that 
is responsible for problem determination and re- 
covery of the link connection subsystem. The data 
communications system contains an intermediate 



3 



EP 0 403 414 A2 



4 



control and translation facility that receives the 
problem deternriinatlon request from the commu- 
nication link manager. The communication link 
manager has access to a dynamic physi cai con- 
figuration data base that constitutes a network map 
and contains information for each link under the 
communication link manager's responsibility. This 
data base identifies the salient characteristics for 
each physical device constituting each communica- 
tion link within the purview of the communication 
link manager. These characteristics include the 
communication protocol requirements, the capabil- 
ities and physical location or addresses for each 
such device, spare resources that are available for 
automatic recovery and the components of the 
translation facility responsible for the various link 
segments. The communication link manager ac- 
cesses these data files for an identified target node 
and link in response to a failure Indication from a 
using node. The translation control facility then 
issues one or more device-specific addressed 
problem isolation, determination or control com- 
mands onto the communication link. It receives 
device-specific responses from devices which it is 
capable of accessing utilizing the information from 
the physical configuration data base. This inter- 
mediate translation and control facility, under direc- 
tion of the communication link manager, may then 
issue still further commands to specific physical 
devices until the problem situation has been iso- 
lated and Identified. This information is commu- 
nicated back to the communication link manager 
which responds with a command to the translation 
facility to initiate recovery using specific proce- 
dures and, if necessary, provide information on 
ports to swap. The translation facility effects the 
recovery procedure and informs the communication 
link manager that recovery was successful, or that 
recovery was attempted but failed. If recovery is 
successful, an alert Is sent to the host indicating 
that a failure occurred and that recovery was suc- 
cessful. 

This invention will be described with respect to 
a preferred embodiment thereof which is further 
illustrated and described in drawings. 

Figure 1 illustrates a schematic data commu- 
nication system or network constructed in accor- 
dance with the present invention as a preferred 
embodiment thereof. 

Figure 1A illustrates the logical structure at 
the using node as embodied in the present inven- 
tion, 

Rgure 2 illustrates the structure of the using 
node link connection manager, also referred to as 
data link control, and the interface between the 
communication link manager and the using node in 
the present invention, 

Rgure 3 is an automatic problem determina- 



tion and recovery flow chart showing how the com- 
munication link manager conducts the diagnostic 
processes and communicates with various link 
components through a link connection subsystem 
5 manager to effect the recovery of failed link com- 
ponents. 

Figure 4 illustrates a general communica- 
tions network and indicates the link connection 
components which can be subjected to automatic 
10 recovery procedures. 

Figure 5 illustrates the various configurations 
supported by the preferred embodiment of this 
invention. 

Figure 6 illustrates an example configuration 
15 supported by the preferred embodiment of this 
Invention. 

Rgure 7 illustrates a flow chart indicating the 
algorithm for isolating a connection link problem 
and applying the appropriate recovery procedure 
30 for the link configuration depicted in Figure 6, 

The present invention provides a method and 
system for allowing a communication link manager 
running in a data communication system to provide 
problem determination and recovery in the physical 
25 computer data communications links and connec- 
tions. The invention is conceptually and schemati- 
cally illustrated in Rgure 1. 

The communication link manager (CLM) 2 illus- 
trated in Rgure 1 is responsible for problem deter- 
30 mination and recovery of the link connection sub- 
system. The CLM 2 receives input from two logical 
components of the using node 6, the using node 
control 2Z and the link connection manager 20 as 
well as from the service system 3 containing the 
35 link connection subsystem managers (hereinafter 
LCSM). The CLM 2 coordinates all problem deter- 
mination and recovery dealing with the link connec- 
tion subsystem including the control of multiple 
LCSMs 3. The CLM 2 obtains complete link con- 
40 nection configuration from the link connection con- 
figuration manager (referred to as LCCM 4) includ- 
ing all link connection components (LCCs) 10 in the 
link connection path, the LCSMs 3 responsible for 
the components, and the backup components to 
45 use if recovery is necessary. The CLM 2 contains 
unique logic for the different link connection con- 
figurations supported. 

The using node 6 is responsible for detecting 
link connection problems. It must determine if the 
50 problem is caused by internal hardware or software 
errors or with the link connection. Using node 6 can 
be logically decomposed into two functions: a us- 
ing node control 22 and a link connection manager 
20 as shown in Rgure 1A, The link connection 
55 manager 20. also referred to as using node data 
link control, modifies the error recovery procedure 
for consecutive errors. Three retries are used for 
each resource that has been active. The link con- 
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nection manager 20 notifies the communications 
link manager 2 via using node control 22 whether 
all active resources or just one resource is failing, it 
also provides link connection notification to CLM 2 
when a resource becomes active. This notification 
includes the entry and exit addresses and path 
information if appropriate. This allows dynamic con- 
figuration changes to be reflected in the configura- 
tion data base 4. The link connection manager 20 
is responsible for problem isolation of the link con- 
nection entry point It is also responsible for pass- 
ing link event data to the communications link 
manager 2 via using node control 22 when the 
problem is with the link connection. 

The using node control 22 analyzes using node 
6 error conditions and passes them on to the CLM 
2 for automatic recovery. It is also responsible for 
passing link connection manager 20 link event con- 
ditions to CLM 2 for analysis and automatic recov- 
ery. 

The link connection configuration manager 
(LCCM) maintains the current status of each re- 
source. The LCCM 4 is responsible for complete 
link subsystem configuration data. This irrcludes all 
link connection components 10 in the link connec- 
tion path, the way these components are physically 
connected, the LCSMs 3 responsible for the com- 
ponents and the backup components to use if 
recovery from failure is necessary. 

The link connection components (LCCs) 10 
perform the physical functions of the link segment 
such as statistical multiplexing, time division mul- 
tiplexing, modems, front end line switches, local 
area networks, etc. The LCC 10 has two functions 
to perform in the event of a problem. The first is to 
determine if the problem is caused by an internal 
error or with the link segment. If the problem is 
internal, the LCC 10 passes this information to the 
appropriate LCSM 3. If the error is with the link 
segment, the LCC 10 collects the link event data 
and passes this data to the LCSM 3. The LCCs 10 
also provide a data wrap of each interface. For 
example, if the modems use the same techniques 
to transmit and receive data, then the transmit data 
could be wrapped to the receive data for a particu- 
lar modem. 

The first major function of the CLM 2 is prob- 
lem determination. It receives product error data 
and link event data from using node 6, requests 
configuration data and component backup informa- 
tion from the configuration data base 4, selects 
appropriate problem determination logic depending 
on the configuration returned from the LCCM 4, 
invokes the appropriate LCSM 3. examines results 
from application analysis, detemrtines if other ap- 
plications (I.e., LCSMs) should be invoked, and 
initiates the recovery function which is the other 
major function of CLM 2. The CLM 2 receives 



requests for recovery action with the failing compo- 
nent identified, determines if a recovery action is 
defined and invokes an appropriate recovery ac- 
tion, i.e.. swap modem, swap port, switch network 

s backup, etc. 

The CLM 2 selects the appropriate recovery 
action logic depending on the identity of the failing 
component. If recovery is successful, the CLM 2 
notifies using node data link control 20 to restart 

10 and sends a notification to the host system 1 
identifying the defective component. The CLM 2 
updates the LCCM 4 with any changes. This in- 
cludes flagging defective components, removing 
backup components from the backup pool, and 

75 adding backup components to the path. If recovery 
is unsuccessful, or if no recovery action is defined, 
the CLM 2 sends an alert to the host system 1 
indicating that recovery was attempted but failed, 
and notifies data link control 20 to send an INOP 

20 notification. 

Service system 3 provides an intermediate 
control and translation facility and will generally be 
in the form of a program which is executed to 
provide specific diagnostic processes. The service 

25 system is comprised of one or more link connec- 
tion subsystem managers (LCSMs). Each LCSM 3 
is responsible for a specific link segment of the link 
connection subsystem. It sends commands to the 
link connection component 10 and receives both 

30 solicited and unsolicited responses. Each LCSM 3 
is capable of managing its link segment indepen- 
dently of any other LCSM. When commanded by 
CLM 2, the service system 3 generates device- 
specific requests and inquiries into the various link 

35 connection components (LCCs) 10 and logical links 
9 included in target system 5 of Figure 1 . A par- 
ticular using node A is identified as numeral 6 
within the overall communications link which links 
the using node 6 with a target station 7. Overall. 

40 the node 6. the station 7 and the interconnecting 
communication components 10 and links 9 may be 
comprised of any components and constitute a 
"target system" for which automatic problem di- 
agnosis and non-disruptive recovery is desired in 

45 the context of the present invention. 

Rgure 1 illustrates the basic scheme, architec- 
tural design and logical flow of the preferred em- 
bodiment of the invention. In general, communica- 
tion link manager 2 needs to be able to send 

60 diagnostic requests to and receive responses from 
the intermediate translation and control facility to 
get Information about the logical link connections. 
In the invention, the CLM 2 accesses the configura- 
tion data base 4 to detenmine the present physical 

55 configuration of the specific link to the target sys- 
tem node or station for which problem determina- 
tion Is required. The CLM 2 armed with device- 
specific data and the physical network configura- 
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Hon and interconnection identifications showing the 
physical components making up a given commu- 
nications link issues problem determination re- 
quests to LCSM 3. LCSM 3 is then enabled to 
Issue device-specific commands to the LCCs mak- 
ing up the given link for which problem determina- 
tion is required. The. LCSM 3 generates appropriate 
device-specific diagnostic and control commands 
to be sent to the physical target system devices to 
analyze their condition and to analyze the data 
responses returned from the LCCs 10, The LCSM 
3 relays device-specific replies to CLM 2 which, in 
turn, generates either additional inquiries or the 
appropriate recovery procedure to the LCSM 3. 
After the recovery action Is completed. CLM 2 
directs the using node data link control 20 to restart 
the application and notifies the host 1 of the prob- 
lem and successful recovery. 

As alluded to earlier, target system link con- 
nections may be made up of multiple layers herein 
called link subsystems comprising various types of 
physical devices, device capabilities, and commu- 
nication protocols all of which are usually provided 
by diverse vendors. As information and data com- 
munication networks of the type shown in Rgure 1 
are constructed by users, they generally grow to 
support multiple protocols including IBM system 
standard SNA and non-SNA protocols. The sys- 
tems usually contain multiple logical networks and 
may. Incorporate one or more physical networks. 
The data communications networks so constructed 
not only provide for the transportation of data in the 
normal systems today, but they may also include 
analog or voice information which has been suit- 
ably digitized for communications and control pur- 
poses. These factors all compound the problem of 
diagnosis for any specific problem isolation in the 
network, and particularly for isolating faulty compo- 
nents in a given link. 

Management of the link connection requires its 
own hierarchical structure for commands and data 
flow. The physical link connection components 10 
are the LCCs, LCC-1 through LCC-N shown in 
Figure 1. Together the LCCs form physical link 
segments 9 between each LCC and the next LCC. 
node or station. The segments 9 provide the overall 
physical link connection between a using node 6 
and a remote node 7. The logical SNA view of the 
link does not always reflect the fact that physical 
and hierarchical structural changes may occur or 
that various physical devices actually exist to make 
up the physical link segments 9. 

The actual physical link is the connection or 
components that a user has assembled forming the 
logical link between two stations 6 and 7. In Figure 
1 , using node A and remote node B could be SNA 
or non-SNA nodes. Both node A and node B con- 
tain a station that wishes to communicate with the 



other station through the physical link connection. 
The link connection is composed of some arbitrary 
number of physical connection components. Each 
of the physical components 10 is referred to as a 
5 link connection component LCC. Each LCC re- 
ceives signals on its physical interconnection to the 
link and then passes them to another LCC or 
eventually to a node. The LCC Is the smallest 
addressable element in the link connection. It can 
10 sometimes be addressed through a data link pro- 
tocol of general usage or may require a specialized 
format The addressability allows the LCSM 3 to 
address each LCC 10 using the con-ect protocol 
and physical capabilities that have been passed to 
75 it by CLM 2 from the data stored in the LCCM 4. 
The LCSM may then send request for data to 
specific LCCs 10 in a format or protocol understan- 
dable to them. The LCSN/I 3 then receives the 
responses from the LCCs 10 and interprets them. 
20 The communication path segment connecting 

two LCCs 10 is identified as a link segment 9 in 
Rgure 1. Each link, segment 9 is a portion of a 
communication path that may include copper wire, 
coaxial cable, optical fiber, microwave links or other 
25 satellite or radio communication components. In 
addition, a link segment 9 may contain other LCCs 
10 and other link segments 9. The definition thus 
given is generic. Each LCC reports to an assigned 
link connection subsystem manager. LCSM 3. A 
30 LCSM 3 is thus responsible for one or more spe- 
• cific LCCs the link segments between them. A 
LCSM 3 may address its assigned collection of link 
connection components 10 and is responsible for 
the management of the components themselves. 
35 The collection of components addressed by a 
LCSM 3 Is referred to as the link subsystem for 
logical analysis. 

A typical using node 6 as shown in Rgure 1 
may be a communications controller or a commu- 
40 nications system having an integrated communica- 
tion adapter. Either type of device incorporates 
program controlled self-diagnostic capabilities for 
determining whether the node itself is the cause of 
a problem or whether the problem is in the com- 
45 municatlons link attached to It. The node 6 is 
assigned the responsibility for detecting link con- 
nection problems. This is usually accomplished by 
inference from the nodes' Inability to send or re- 
ceive, or by the detection of an extensive number 
50 of retransmissions being required on that link. The 
using node 6 has two functions to perform in the 
event that a problem is detected. First, node 6 
must perform an elementary probable cause analy- 
sis of each problem to determine whether the prob- 
55 lem Is due to a failure In the node itself or whether 
it lies In the link connection. When a failure is 
discovered within a node itself, the node runs its 
own internal diagnostic routines to isolate the prob- 
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lem. for reporting the problem. An Innprcperly func- 
tioning comnnunicatlons adapter, i.e.. a line driver, 
internal nnodem or the like, might be found in such 
a test When the problem is not within the node 6 
itself, it is assumed to lie somewhere within the link 
connection components attached to the node. The 
node- then reports this link connection failure to the 
communication link manager, CLM 2. An IBM 3725 
communications controller is a typical example of a 
system that Is capable of diagnosing faults in itself 
or identifying that the fault is not within itself but 
within the communication link connection Instead. 
These devices have the capability of reporting an 
error condition to CLM 2 together with any asso- 
ciated data that is normally maintained such as 
traffic counts, error counts and adapter interface 
status conditions that may be reported. 

It is the responsibility of CLM 2 when notified 
of a problem by using node 6, to determine the 
communications link connection configuration for 
the link between node 6 and node 7 as shown in 
Rgure 1. This Information will be obtained from the 
LCOM 4 as will be described below. 

The LCSM 3 is responsible for managing its 
assigned link subsystem. It receives notification of 
potential problems from the CLM 2. The LCSM 3 is 
the^ component in the system that must send com- 
mands to the LCCs 10 and which receives their 
responses for forwarding on to CLM 2. The LCSM 
obtains from the CLM 2 the present connection 
configuration identifying fhe elements in the given 
communication link together with the addresses of 
the individual LCCs 10 with which it will commu- 
nicate. Neither CLM 2 nor LCSM 3 contains the 
configuration map of the entire link connection. 

The LCSM 3 receives commands or inquiries 
from and responds to requests to be performed 
from the CLM 2 in the fomn of generic requests. 
The requests from the CL.M 2 are directed to a 
gh/en LCSM in response to indications of a prob- 
lem being reported by a given node 6. The CLM 2 
will determine which LCSM 3 has been assigned 
the management function for determining problems 
within the identified links connected to the node 6. 
The CLM 2 Issues problem detenmination requests 
to the LCSM 3 that is assigned to manage a given 
communications link and will identify the target 
node 7 for problem determination. 

The CLM 2 provides LCSM 3 with information 
derived from LCCM 4 on the communications link 
configurations and addresses of devices constitut- 
ing the link. The LCSM 3 is also provided with the 
appropriate protocol or communication method for 
each identified specific LCC 10. The LCSM 3 then 
generates a series of inquiries to implement testing 
of the communication link between nodes 6 and 7 
by issuing diagnostic commands to the individual 
LCCs 10. LCSM 3 will eventually report back to the 



CLM 2 that a failure has been detected and iso- 
lated to a specific LCC 10. that a failure has been 
detected but not isolated to a specific component, 
or that no trouble has been found on the link. 

5 If LCSM 3 isolates the problem, it will send a 

response to CLM 2 indicating the specific LCC 10 
or connection to a specific component which is the 
source of the problem. Because of its limited capa- 
bilities, the LCSM 3 does not determine the prob- 

10 able causes of error for the entire link connection 
that may exist but only for its own assigned LCCs 
10. Other LCSMs may also be involved in deter- 
mining probable cause error conditions for still oth- 
er LCCs In a given communications link between 

TS two nodes. The CLM 2 must be able to send 
requests for tests to multiple LCSMs 3 in such a 
system configuration in order to determine fully 
what problem is occurring and which element is 
responsible. 

20 The LCCs 10 typically may be protocol con- 

verters, computerized branch exchanges, time di- 
vision multiplexers, modems, statistical multiplex- 
ers, front end line switches, local area network 
controllers and the like. Each iink connection com- 
26 ponent 10 will perform specific functions that it is 
capable of carrying out at the physical link layer 
that it represents. These functions include digital to 
analog signal conversion, typically performed by 
modems; line multiplexing, normally performed by 
30 multiplexers or some switching systems: and other 
functions thaf affect the physical data transmission 
layer. Each LCC 10 monitors its own operational 
condition for its own link segment. At the occur- 
rence of failures, the LCC 10 affected may initiate 
35 various tests to determine the causes of failure. A 
given LCC 10 may attempt a recovery action when 
a problem is detected by initiating internal self test 
and/or by notifying its neighbors of a problem. The 
LCC 10 may participate with its neighbors in prob- 
40 lem determination procedures which may include 
performing wrap tests or line analysis tests such as 
those carried out typically by some "intelligent" 
modems. Each LCC 10 must be able to execute 
and respond to diagnostic commands received 
45 from its assigned managing LCSM 3. The com- 
mands may cause functions to be performed such 
as self-testing, status checking at interfaces, and 
collection of various operating parameter settings 
for the individual devices. The specific commands 
50 that may be received from the LCSM 3 will nec- 
essarily be in the proper format and/or protocol 
required by the LCC 10. Since various LCCs have 
differing capabilities and may implement different 
command functions and responses, the physical 
55 configuration of the Individual links between nodes 
served by a given LCSM 3 are maintained in the 
LCCM 4. 

Summarizing the operation of the overall sys- 
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tern design just discussed, it may be seen that the 
structure as described in Figure 1 exists concep- 
tually for each link connection. At a system level, 
however, the CLM 2 incorporates processes that 
enable it to handle problem determination and non- 
disruptive recovery for all segments of all the link 
connections incorporated within a link subsystem 
or target subsystem between a using node 6 and a 
remote node 7 as shown in Figure 1 . This type of 
design allows the CLM 2 and LCSMs 3 to be 
developed to support any type of communications 
protocol or command environment, thus either SNA 
or non-SNA link connection compo nents may be 
accommodated easily. The infomnation that flows 
between the CLM 2 and both the LCCM 4 and the 
LCSMs 3 is designed to provide a generic support 
capability that allows any type of product LCC 10 
to be supported. Any product unique command, 
protocols or requirements are generated by the 
LCSM 3 in response to the information it receives 
from CLM 2 regarding the physical configuration of 
the link from the LCCM 4. All of the necessary 
problem determination and recovery processes are 
controlled by the CLM 2, which directs the LCSM 3 
to perform a sequential series of analyses on each 
LCC 10 until the problem is isolated and identified, 
at which time CLM 2 initiates a selected recovery 
procedure via the appropriate LCSM 3. 

Each of the LCCs 10 must be addressable 
from the corresponding LCSM 3 in order to receive 
commands, execute tests, gather data or transmit 
information. The way in which this is accomplished 
depends upon many factors including the location 
of the LCSM 3 relative to the given LCC 10 in the 
network, the type of communication path available 
for communication, and the type of data link control 
protocol that is utilized. The many differences in 
the communication factors mentioned for each 
LCSM 3 make it unlikely that a single addressing 
structure for all LCCs 10 may be implemented. 
However, within the link connection subsystem 
which exists for a given target system between 
nodes 6 and 7, tiiere should be a consistent means 
of alerting the host system 1 of the conditions of all 
the LCCs that comprise the link. This is accom- 
plished by this invention which is capable of send- 
ing to the host system 1 the present physical 
connections making up the link together with ge- 
neric responses identifying which element or ele- 
ments have been isolated as being the source of 
the problem. 

Illustrated in Figure 2 are the different func- 
tional layers comprising data link control 20 (i.e., 
link connec tion manager of Figure 1 A) at the using 
node 6. Logical Link Control 23 provides the logical 
appearance of the link to the host. For example, in 
SNA a link will have a logical name that maps to a 
Network Addressable Unit (NAU) which is the SNA 



logical address for the link. Inserted in data link 
control 20 between the logical link control layer 23 
and the physical link control layer 28 are the phys- 
ical link error detection and recovery layer 26 and 

5 the CLM Interface 24. Physical Link Error Detection 
Recovery 26 is responsible for detecting errors on 
the physical link and performing recovery for the 
specific error until a predetermined threshold is 
exceeded. CLM Interiace 24 allows data link control 

w to pass control to a higher level (CLM 2) when the 
predetermined Physical Unk Error Recovery 
threshold has been exceeded. The CLM interface 
24 interrupts data link control 25 without the com- 
munication sessions between using node 6 and 

75 remote node 7 going down. Thus, the failure is 
automatically bypassed while recovery Is attempt- 
ed. 

Rgure 3 illustrates an automatic problem deter- 
mination and recovery flow chart showing how the 
20 communication link manager 2 conducts the di- 
agnostic processes and communicates with the 
LCSMs 3 to coordinate recovery of the failed link 
components 10. The using node DLC 20 detects 
link connection problems and first determines if the 
25 problem is with a node itself or with the link con- 
nection. In the latter case, the using node DLC 20 
passes link event data to the CLM 2 and holds the 
INOP signal. Upon receiving the link event data, 
the CLM 2 requests configuration data from LCCM 
30 4 which provides CLM 2 with the complete link 
subsystem configuration data, including all LCCs 
10 in the link connection path, the LCSM 3 respon- 
sible for specific segments of the link, and the 
backup components for recovery. 
35 The CLM 2 then initiates link problem deter- 

mination by selecting the appropriate problem de- 
termination logic depending on the configuration 
returned from LCCM 4 and invokes the appropriate 
LCSM 3 for application analysis. LCSM 3 then tests 
40 the specific link segment of the link connection 
subsystem for which it is responsible. The LCCs 10 
In the link segment determine if a problem is 
caused by an Internal error or with the link seg- 
ment. If the problem is internal, the LCC 10 passes 
45 this information to LCSM 3. If the error is with the 
link segment, the LCC 10 collects the link event 
data and passes this data to LCSM 3. 

Having received the failure data from an LCC 
10. the LCSM 3 sends a response to CLM 2 
50 identifying the failed component. This completes 
the problem determination stage. With knowledge 
of both the link configuration and the failed compo- 
nent, the CLM 2 sends a request to the LCSM 3 to 
initiate a specific recovery action (e.g., swap ports). 
55 LCSM 3 then sends a command to the appropriate 
LCC 10 to execute the recovery action. After the 
procedure is completed. LCC 10 notifies LCSM 3 
which. In turn, signals CLM 2 that recovery Is 
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complete. CLM 2 next informs using node control 
22::of the port swap and receives an acknowledge- 
ment In return. CLM then directs the using node 
data link control 20 to restart the link. Using node 
DLC 20 advises CLM 2 when restart has been 
completed. Finally. CLM 2 sends a notification to 
the host system 1 that a failure has occurred and 
that recovery has been successful. 

Illustrated in Rgure 4 are some typical link 
connection components of a computer network 
which were referred to previously in a generic 
sense as LCCs 10 in Figure 1. Also shown are the 
connecting links which physically connect the var- 
ious LCCs and which were identified by reference 
numeral 9 fn Rgure l. The network has a host 
system 1 connected to a communications controller 
11 which may contain the using node 6. Connected 
to the communications controller 1 1 is digital front 
end line switch 31 (FELS(DS)). The output of front 
end line switch 31 is connected to statistical mul- 
tiplexer (STAT MUX) 34 whose output ports are 
connected to digital front end line switch 32. The 
output ports from FELS{DS) 32 are shown con- 
nected to modem 42. data service unit (DSU) 43, 
and time division multiplexer (TDM) 36. The out- 
puts from modem 42 and DSU 43 are routed to 
analog front end line switch 33 (FELS{AS)) for 
transmission over a line to modem 44 which 
passes the received data on to station 46. The 
output side of TDM 36 is routed via a T1 line to 
TDM 38 whose output is passed to station 48 via 
statistical multiplexer 40. Also shown in Figure 4 
are two computerized branch exchanges (CBX) 52, 
54 linked to TDMs 36 and 38 respectively. The 
Snks and link connection components for which 
recovery from a failure is possible are indicated by 
the P + R label and arrow drawn to the link or 
component The label P indicates that problem 
determination is possible but not automatic recov- 
ery from failure. This latter situation applies to 
modems, links and stations at the remote location. 

The different configurations of devices sup- 
ported by this invention are indicated in Rgure 5. 
The column headings identify the LCSM 3 having 
responsibility for the particular LCC 10. For exam- 
ple, configuration 4 includes a using node which is 
managed by LCSM-1. a front end line switch man- 
aged by LCSM-2. a pair of statistical multiplexers 
managed by LCSM-3 and a pair of time division 
multiplexers managed by LCSM-4. The actual 
physical layout corresponding to this particular 
configuration is illustrated in Figure 6. The LCSM 
responsible for the statistical multiplexers, i.e., 
LCSM-3, is also responsible for problem isolation 
of the modems 45. 47 downstream from the STAT 
MUX 40. This is true in general, and in configura 
tions having TDMs but not having STAT MUXs. 
such as configurations 3 and 7 in Figure 5, the 



LCSM responsible for the TDM is also responsible 
for problem Isolation of the modems downstream 
from the TDM. In the absence of either STAT 
MUXs or TDMs, the using node 6 will be responsi- 
5 ble for problem Isolation of modems. 

In Figure 6. LCSM-2 is responsible for problem 
Isolation of the path between its link segment entry 
and exit points. The entry point is the input port to 
FELS 31; the exit point is the output port from 
TO FELS 31 . However, LCSM-2 is not responsible for 
problem isolation of the modems, TDMs, or STAT 
MUXs on the path attached to the FELS 31. 

LCSM-3 is responsible for problem isolation of 
the path between the link segment entry point at 
15 the input port on STAT MUX 34 to the link seg- 
ment exit point at the output port on STAT MUX 
40. LCSM-3 is also responsible for problem isola- 
tion of the modems 46. 47 downstream from STAT 
MUX 40. However, LCSM-3 is not responsible for 
20 problem isolation of the TDMs 36. 38 located be- 
tween STAT MUXs 34 and 40. 

LCSM-4 is responsible for problem isolation of 
path between the link segment entry point at the 
input port on TDM 36 to the link segment exit point 
25 at the output port on TDM 38. 

For each supported configuration, there is an 
algorithm which is used by CLM 2 to isolate and 
resolve the link problem. The algorithm corre- 
sponding to the physical layout depicted in Rgure 
30 6 is provided by the logic chart of Rgures 7A and 
7B. The other algorithms are similar in structure 
and one skilled in the art could easily derive them 
from the framework depicted in Rgures 7A and 7B 
which illustrate the most complex configuration 
35 among, the ones supported. 

Once a problem has been isolated. CLM 2 
invokes the appropriate recovery procedure to re- 
cover from the problem. If recovery is successful, 
CLM 2 notifies the host that a failure was detected 
40 and that recovery was successful. The failing com- 
ponent is Identified and the configuration data base 
(LCCM 4) is updated to reflect the current configu- 
ration. If recovery is not successful, the CLM 2 
notifies DLC 20 to INOP the resource and sends 
45 notification to the host 1 indicating the failing re- 
source and that recovery was attempted but failed. 

Referring now to Figures 7A and 78. the first 
logical test as indicated by block 102 is to deter- 
mine if there is a problem at the using node - local 
so modem interface. If there is. the digital front end 
line switch 31 is reconfigured to perform a wrap 
test. The data link control 20 is requested to wrap 
the port on the FELS 31. The wrap test is per- 
fonmed next as indicated in logic block 106. 
55 If the wrap test in logic block 106 Is successful, 

the problem has been narrowed to the local statisti- 
cal multiplexer 34. and the appropriate recovery 
procedure is to swap the identified port as in- 
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dicated in block 108. The recovery function of CLM 
2 determines the actual ports that have to be 
swapped from the configuration data base main- 
tained by LCCM 4 and which port to use as a 
spare. When the CLM 2 receives a response from 
the PELS application 31 that the digital switching is 
complete and from the STAT MUX application 34 
that the port was swapped, It notifies DLC 20 to 
restart the line. 

If the wrap test in bIocI< 106 is not successful, 
the problem has been narrowed to either the PELS 
or line interface control (UC) from communications 
controller 11. and the appropriate recovery proce- 
dure is to swap the identified ports between the 
PELS 31 and communications controller 11. The 
CLM 2 determines the actual ports to be swapped 
from the configuration data base maintained by 
LCCM 4 and which port to use as a spare. When 
the CLM 2 receives a response from the PELS 
application 31 that the digital switching is complete 
and that the port has been swapped, it notifies data 
link control 20 to restart the line. If in block 102 the 
using node • local modem interface is not falling, 
the link connection between the using node and the 
remote stations is checked to determine if a single 
station or if all stations are failing. If the logic test 
indicated by block 112 Indicates that only a single 
Station is failing, the CLM 2 will determine the 
actual ports that have to be used from the configu- 
ration data base maintained by LCCM 4 and the 
phone" numbers to use for switched network bac- 
kup (SNBU). The CLM 2 notifies the STAT MUX 
application to dial and restart the line. If in logic 
block 112 there is an indication that all stations are 
failing, the next test is performed in logic block 114 
to determine the status of the STAT MUXs. If the 
result of the logic test in block 1 1 4 is that there is a 
failure on the link segment containing the STAT 
MUXs. a series of tests is performed to determine 
if the failure is between the STAT MUXs (logic 
block 1 1 6), if the failure is downstream of the link 
segment (logic block 118). or if the failure Is up- 
stream of the link segment (logic block 120). 

If in logic block 118 the failure is determined to 
be downstream of the STAT MUXs. the CLM 2 
uses remote switched network backup (SNBU) to 
recover from the failure. If in logic block 120 the 
failure is found to be upstream of the link segment 
between STAT MUX 34 and STAT MUX 40. the 
problem Is with either the local STAT MUX 34, 
front end line switch 31 or line interface control. 
The problem is isolated by reconfiguring the PELS 
31 to do a wrap test and requesting data link 
control 20 to wrap the port on the PELS 31 . If the 
wrap is successful, the port is swapped on the 
local STAT MUX 34. OthenftriSe. the port on using 
node 6 is swapped. 

If the STAT MUX test performed in logic block 



114 indicates no failure, then the TDMs are tested 
as shown in logic block 132 in Figure 78. Logic 
block 126 also tests the status of the TDMs. If in 
either logic block 126 or logic block 132, the status 
5 of the TDMs is failure, then CLM 2 tries to as- 
certain if the failure is between TDM 36 and TDM 
38 (logic block 134), downstream of the link seg- 
ment between the TDMs (logic block 136) or up- 
stream of the link segment between TDMs (logic 

TO block 138). 

If in block 134 the failure is identified as being 
on the link segment between the TDMs. then the 
appropriate recovery procedure is to activate an 
alternate path between the TDMs. The CLM 2 
15 determines the actual ports to use. If In block 136 
the failure is determined to be downstream of the 
link segment between the TDMs, CLM 2 uses 
remote switched network backup (SNBU) to at- 
tempt to recover from this failure. If in block 138 
20 the failure is detennined to be upstream of the link 
segment between the TDMs. CLM 2 swaps ports 
between the using node and STAT MUX. The CLM 
2 determines the actual ports to swap from the 
configuration data base 4. When the CLM 2 re- 
25 ceives a response from the STAT MUX application 
that the ports were swapped, it notifies DLC 20 to 
restart the line. 

If the TDM status test in block 126 indicates 
that the TDMs have not failed, the appropriate 
30 recovery procedure is to activate an alternate path 
between the STAT MUXs and the TDMs. The CLM 
2 determines the actual ports to swap from the 
configuration data base maintained by LCCM 4. 
If the TDM status test in block 132 indicates 
35 that the TDMs have not failed, then CLM 2 re- 
quests DLC 20 to wrap the remote STAT MUX 40 
as indicated in logic block 148. If the wrap test 
performed in block 150 on the remote STAT MUX 
40 is successful, the appropriate recovery action is 
40 to use remote switched network backup (SNBU) to 
recover from the failure. The CLM 2 determines the 
actual ports to use and the phone numbers to use 
for SNBU. It notifies the STAT MUX application to 
dial and restart the line. 
45 If the wrap test on the remote STAT MUX is 

not successful, then CLM 2 requests DLC 20 to 
wrap the local STAT MUX 34 as indicated in logic 
block 154. If the wrap test performed on the local 
STAT MUX is successful (logic block 156). the 
so problem is between the local and remote STAT 
MUXs and the appropriate recovery procedure is to 
activate an alternate path between the STAT MUXs 
and TDMs. The CLM 2 determines the actual ports 
to use from the configuration data base maintained 
55 by LCCM 4. If the wrap test performed on the local 
STAT MUX 34 is not successful, then CLM 2 
reconfigures the front end line switch 31 to perform 
a wrap test CLM 2 requests a DLC 20 to wrap 
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front end line switch 31. If in block 160 the wrap 
test on the front end line switch 31 is successful, 
the problem is with the local STAT MUX 34 and 
the appropriate recovery procedure is to swap port 
on local STAT MUX 34. The CLM 2 determines the 
actual ports to swap from the configuration data 
base 4. When the CLM 2 receives a response from 
the PELS application that the digital switching is 
complete and from the STAT MUX application that 
the port has been swapped, it notifies DLC 20 to 
restart the line. On the other hand, if the wrap test 
on the front end line switch in block 160 is un- 
successful, the problem is with PELS 31 or the line 
interface control. The appropriate recovery proce- 
dure is to swap the port on the using node. The 
CLM 2 determines the actual ports to swap from 
the configuration data base maintained by LCCM 4. 
When the CLM 2 receives a response from the 
PELS application that digital switching is complete 
and that the port has been swapped, it notifies DLC 
20 to restart the line. 

While the invention has been described with 
respect to a preferred embodiment, it will be appar- 
ent to those skilled in the art that various Im- 
plementation of the invention may be envisioned 
that would relocate the functional capacities of the 
CLM 2. the LCSM 3. the LCCM 4 and the various 
LCCs 10 without departing from the spirit and 
scope of the invention. Therefore, what is de- 
scribed and what is intended to be protected by 
Letters Patent is set forth in the claims which follow 
by way of example only and not as limitation. 



Claims 

1 . A method for automatic non-disruptive profc>- 
lem determination and recovery of communication 
link problems between a using node (6) and a 
remote node (7) in a data communication network 
characterized In that it comprises the steps of: 
detecting the communication link problem at said 
using node, 

accumulating link event data at said using node 
and passing said link event data to a communica- 
tion link manager (2). 

responsive to said link event data, sending a re- 
quest from said communication link manager to a 
link configuration manager (4) for link subsystem 
configuration data on said communication link and 
returning said configuration data to said commu- 
nication link manager. 

selecting the problem determination logic for the 
configuration returned from said link configuration 
manager and invoking a link connection subsystem 
manager (3) for application analysis, 
conducting testing of the link segment of said com- 
munication link for which said link connection sub- 



system manager is responsible, 
determining the failed component on said commu- 
nication link by said link connection subsystem 
manager and identifying said failed component to 

5 said communication link manager, and responsive 
to the receipt of said identified failed component 
initiating the recovery action by said communica- 
tion link manager. 

2. The method of claim 1 wherein the step of 

10 detecting the communication link problem at a us- 
ing node (6) includes the steps of determining if 
the problem is internal to the using node, and if the 
problem is not internal to the using node, determin- 
ing if the problem is with the link connection. 

15 3. The method of claim 1 or 2 wherein the step 
of returning said configuration data to said commu- 
nication link manager (2) includes identifying the 
link connection components on said communication 
link and the physical connections between said link 

20 connection components, identifying the link con- 
nection subsystem manager (3) responsible for 
each link connection component, and identifying 
the backup components and physical connections 
to use for non-disruptive recovery. 

25 4. The method of claim 1 . 2 or 3 further includ- 

ing the step of determining if at least one more link 
connection subsystem manager (3) should be in- 
voked for application analysis. 

5. The method of any one of claims 1 to 4 
30 wherein the step of initiating a recovery action 

includes determining if a recovery action is defined 
for the failed component 

6. The method of any one of claims 1 to 5 
further including updating the data base maintained 

35 by said link configuration manager (4) after said 
recovery action is complete. 

7. The method of claim 6 wherein the step of 
updating the data base includes nagging the defec- 
tive component, removing a backup component 

40 from the pool maintained by said link configuration 
manager (4), and adding said backup component 
to the communication link. 

8. The method of any one of claims 1 to 7 
further Including the steps of notifying data link 

45 control to restart the communication link and send- 
ing a notification to the host system identifying the 
failed component if said recovery action is suc- 
cessful. 

9. The method of any one of claims 1 to 8 
so further including the steps of sending an alert to 

the host indicating that a recovery action was at- 
tempted but failed and notifying data link control to 
send an inoperative signal. 

10. The method of any one of claims 1 to 9 
55 further including the step of interrupting data link 

control at said using node (6) in order to suspend 
the communication sessions between said using 
node and said remote node (7) until the commu- 
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nication link problem is isolated and resolved. 

11. A system for automatic non-dlsruptiv© 
problem determination and recovery of commu- 
nication link problems between a using node (6) 
and a remote node (7) In the data communication 6 
network comprising: 

a communication link manager (2) connected to 
said communication network for coordinating the 
testing of link components, analyzing the results of 
said testing and invoking a recovery procedure. io 
first means at said using node for detecting a 
communication link problem, 
second means at said using node for accumulating 
link event data and passing said link event data to 
said communication link manager, 'S, 
a link configuration manager (4) cooperating with 
said communication link manager for maintaining a 
configuration data base and providing link configu- 
ration data, said configuration data including the 
identity of the link connection components (10) on 20 
said communication link and the physical connec- 
tion between components, and the identity of bac- 
kup components and the physical connections to 
use for recovery from a communication link prob- 
lem, and 

a plurality of link connection subsystem managers 
(3) connected to said communication network co- 
operating with said communication link manager for 
conducting the testing of assigned segments of 
said communication link, identifying a failed link 30 
component to said communication link manager, 
and issuing recovery commands for said failed link 
component under the direction of said communica- 
tion link manager. 

12. The system of claim 11 including interface as 
means (22) at said using node (6) for interrupting 
the data link control at said using node in order to 
suspend the communication sessions that are ac- 
tive over the communication link between said us- 
ing node and said remote node (7) until the recov- 40 
ery procedure is completed. 
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link manager (2) interacts with a configuration data 
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dure through the link connection subsystem man- 
ager (3) and prompts the data link control to restart 
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